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Abstract 

We review quantitative methods and software developed to analyze genome-scale, 
brain-wide spatially-mapped gene-expression data. We expose new methods based 
on the underlying high-dimensional geometry of voxel space and gene space, and on 
simulations of the distribution of co-expression networks of a given size. We apply 
them to the Allen Atlas of the adult mouse brain, and to the co-expression network 
of a set of genes related to nicotine addiction retrieved from the NicSNP database. 
The computational methods are implemented in BrainGeneExpressionAnalysis, a 
Matlab toolbox available for download. 
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1 Introduction and background 

The mammalian brain is a structure of daunting complexity, whose study started millenia 
ago and has been recently renewed by molecular biology and computational imaging [1]. The 
Allen Brain Atlas, the first Web-based, genome-wide atlas of gene expression in the adult 
mouse brain, was a large-scale experimental effort [2] @J EJ [6] [7]. The resulting dataset 
consists of co-registered in situ hybridization (ISH) image series for thousands of genes. It 
is now available to neuroscientists world-wide, and has given rise to the development of 
quantitative techniques and software for data analysis. The present paper reviews recent 
developments that have been applied to co-expression studies in the mouse brain and are 
publicly available for use on the Web [8] and on the desktop [9]. 

On the other hand, lists of condition-related genes are now available from databases that 
pool results of different studies [101 E] • As these studies employ different methods and result 
in lists of hundreds of genes, it is important to investigate any possible order (or lack of it) 
in these lists. The Allen Brain Atlas provides ways to do this, by stuying brain-wide co- 
expression of genes, and by enabling to compare gene expression to classical neuroanatomy, 
in a genome-scale dataset based on a unified protocol. 
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Advanced data exploration tools have already been developed for the Allen Brain Atlas. 
NeuroBlast allows users to explore the correlation structure between genes in the ABA. was 
inspired by the Basic Local Alignment Research Tool [12J, which derives lists of similar genes 
to a given gene at the level of sequences, and transposed the technique to the analysis of 
similarity between patterns of gene expression in the brain [T3] . The Anatomic Gene Expres- 
sion Atlas [H] was launched in 2007. It is based on the spatial correlation of the atlas. The 
user can explore three-dimensional correlation maps based on correlations between voxels, 
computed using thousands of genes, and retrieve hierarchical data-driven parcellations of the 
brain. 

The Weighted Gene Co-Expression Network Analysis framework (WGCNA) has been 
used to isolate clusters of genes from correlations between multiple microarray samples. In 
this approach the gene networks are typically constructed from the correlation coefficients 
of microarray data, from which graphs are constructed and thresholded at a value chosen 
as as to satisfy certain statistical criteria [T5t [16] . However, in the case of the Allen Brain 
Atlas, gene-expression data are scaffolded by classical neuroanatomy, since ISH data are co- 
registered to the Allen Reference Atlas (ARA) [T7j. The whole brain is voxelized, and the 
voxels are are annotated according to the brain region to which they belong, which allows 
to compare the expression of sets of genes to brain regions (see Figure [6] and [IE]). Hence 
we developed computational methods to: 

1. study the whole range of co-expression values between pairs of genes; 

2. use the Allen Atlas as a probabilistic universe to estimate the distribution of co-expression 
networks; 

3. compare the expression patterns of highly co-expressed sets of genes to classical neu- 
roanatomy. 

These methods are implemented in BrainGeneExpressionAnalysis (BGEA), a Matlab 
toolbox downloadable from www.brainarchitecture.org. They are applied to a set of 288 
genes extracted from the NicSNP database, which have been linked to nicotine dependence, 
based on the statistical significance of allele frequency difference between cases and controls, 
and for which mouse orthologs are found in the coronal Allen Atlas. 



2 Methods 

The spatial frequency of tissue-sectioning in the experimental pipeline of the Allen Brain 
Atlas corresponds to slices with a thickness of 100 micrometers. Each section was registered 
to a grid with a resolution of 100 microns [201 [21] . The induced three-dimensional grid was 
sub-sampled to a resolution of 200 microns in order to increase the overlap between different 
experiments. This procedure results in a partition of the mouse brain into V = 49, 742 
cubic voxels. We focus on the co-registered quantities obtained at a spatial resolution of 200 
micrometers, for several thousands of genes, after subsampling. 
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In particular, the expression energy of each gene labelled g in the Atlas was defined and 
computed [H] at each voxel labelled v in the mouse brain: 

E{v,g) = = , (1) 

where p is a pixel index, and the denominator counts the pixels that are contained in the 
voxel v for the ISH image series of gene g. The quantity M(p) is a Boolean segmentation 
mask that takes value 1 at pixels classified as expressing the gene, and at other pixels. 
The quantity I(p) is the grayscale value of the pixels in ISH images. The present paper 
uses the voxel-by-gene matrix of expression energies E as the digitized version of the Allen 
Brain Atlas. The expression energies of the genes in the full coronal and sagittal atlas can 
be downloaded using the Web service provided by the Allen Institute [22]. 



2.1 Brain-wide co-expression networks: graph properties 

The statistical study of brain-wide co-expression networks using BGEA is summarized in 
the flowchart of Figure [1} Detailed examples of the use of the software are provided in tool- 
box manual [9]. 

The columns of the matrix E of expression energies of Equation [T] are naturally identified 
to vectors in a V^-dimensional space (the voxel space). Given two genes, the two correspond- 
ing columns of the matrix E span a two-dimensional vector of voxel space. The simplest 
geometric quantity to study for this system is the angle between the two vectors. As all the 
entries of the matrix E are positive by construction, this angle is between and tt/2. The 
angle between the two vectors is therefore completely characterized by its cosine, which is 
readily expressed in terms of expression energies. This cosine similarity, defined in Equation 
|2j for genes labelled g and g', is called the co-expression of genes g and g'. 



P / a E(v,g)E(v,g f ) , , 

coExpr(3, g ) = > - (2) 

\/E V u =iE(u } gyZ V w=1 E(w } gy 
The more co-expressed g and g' are in the brain, the closer their cosine similarity is to 1. 

Once the co-expressions have been computed for all pairs of genes in the Allen Brain 
Atlas, they are naturally arranged in a matrix, denoted by C atlas , with the genes arranged 
in the same order as the list of genes in the atlas: 

C atUs (g, gf) = coExpr(s, g') l<g,g'< G atlas , (3) 

where G a tias is the total number of genes included in the dataset (see the next section for 
more details on this choice). The matrix C atlas is symmetric and its diagonal entries are all 
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Figure 1: The flowchart of computational analysis of the collective neuroanatomical proper- 
ties of a set of genes in the BrainGeneExpressionAnalysis toolbox. The steps marked as 
random extractions and computations are described in supplementary materials. 
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equal to one. This diagonal is trivial in the sense that it expresses the perfect alignment of 
any vector in voxel space with itself. When we consider the distribution of the entries of the 
co-expression matrix, we really mean the distribution of the upper-diagonal coefficients. 

Given a set of genes (with G set elements) curated from the literature, possibly coming 
from different studies, one may ask if the brain-wide expression profiles of these genes (or a 
subset thereof) are closer to each other than expected by chance, using the full atlas as a 
probabilistic universe. The set of genes for which brain-wide expression data are available 
from the Allen Atlas of the adult mouse brain consists of 4,104 genes, which is of the same 
order of magnitude as the total number of genes in the mouse genome. The number of sets of 
genes of a given size that can be drawn from the atlas therefore grows quickly with the size 
of the set. To study the co-expression properties of the chosen set of genes, a Gget-by-Gset 
matrix C set can be extracted from the whole co-expression matrix C atlas . A set of strongly 
co-expressed genes corresponds to a matrix C set with large coefficients . To formalise this 
idea, we propose to study the matrix in terms of the underlying graph. There are Galas' 
ways of ordering the genes in the Atlas. They give rise to different co-expression matrices, 
related by similarity transformation. But the sets of highly co-expressed genes are invariant 
under these transformations. The co-expression matrix can be mapped to a weighted graph 
in a straightforward way. The vertices of the graph are the genes, and the edges are as 
follows: 

- genes g and g' are linked by an edge if their co-expression coExpr(<7, g') is strictly positive. 

- If an edge exists, it has weight coExpr(g, g'). 

We have to define large co-expression matrices in relative terms, using thresholds on the 
value of co-expression that describe the whole set of possible values. The entries of the co- 
expression matrices are numbers between and 1 by construction. We define the following 
thresholding procedure on co-expression graphs: given a threshold p between and 1, and 
a co-expression matrix (which can come from any set of genes in the Allen Atlas), put to 
zero all the coefficients that are lower than the threshold (see Figure [2] for an illustration on 
a toy-model with 9 genes). 

The graph corresponding to a co-expression matrix has connected components, and each 
connected component has a certain number of genes in it. The graph properties of C sct 
can be studied by computing the average and maximal size of the connected components at 
every value of the threshold. This induces functions of the threshold that can be compared 
to those obtained from random sets of genes of the same size G set (these computations 
on random sets of genes correspond to the two green boxes in Figure [TJ see Supplementary 
Materials S2 for mathematical details). 
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Figure 2: A toy model with 9 genes, and only three distinct values of co-expression, 0, 0.6 
and 0.9, for simplicity. Before any thresholding procedure is applied (on the left-hand side of 
the figure), there is one connected component. The average and maximum size of connected 
components are both 9. The graph on the right-hand side is obtained by a thresholding 
procedure at a threshold of 0.6. There are three connected connected components, the 
maximal size is 4, and the average size if 3. 



7 



2.2 Cumulative distribution functions of co-expression 

To complement the graph-theoretic approach, we can study the cumulative distribution func- 
tion of the entries of the co-expression matrix of the set of genes to study, and compare it to 
the one resulting from random sets of genes of the same size (see Supplementary Materials 
S2 and S3 for mathematical details). For every number between and 1, the empirical 
cumulative distribution function of C set , denoted by CDF set is defined as the fraction of the 
entries of the upper-diagonal part of the co-expression matrix that are smaller than this 
number. 

To compare the co-expression network of interest C sct to random networks of the same 
size, the procedure is exactly the same as with the thresholded matrices, except that the 
quantities computed from the M random draws are cumulative distribution functions rather 
than connected components (see Supplementary Materials for mathematical details). For 
each random set of G set genes drawn from the Allen Atlas, one can compute the empirical 
distribution function of the corresponding submatrix of C atlas , and average over the draws. 
The average over the draws converges towards the one of a typical network of G set genes 
when the number of random draws is sufficiently large. 



2.3 Comparison to classical neuroanatomy 

Given a brain region ou r , 1 < r < R, where R is the number of brain regions in the Allen 
Reference Atlas [Ej (to which gene expression data are registered), the fitting score of a 
brain- wide function / in this region, or (p r (f) can be defined [18J as the cosine distance 
between this function and the characteristic function of the region. It is formally the same 
as the co-expression of a gene whose expression energy would be the brain-wide function, 
and a another gene that would be uniformly expressed in the region, and nowhere else: 



The distribution of fitting scores in all the brain regions for sets of G set genes can be simu- 
lated by the Monte Carlo methods described in Supplementary Materials S4. 

Even though clustering methods [19] have shown that the correspondence between large 
sets of genes and brain regions in the Allen Atlas is not perfect, it is possible to detect 
small subsets of a set of genes curated from the literature to have exceptionally good fitting 
properties in some brain regions (see Figure [I] for an example of a set of 3 genes detected to 
fit the striatum significantly better than expected by chance). 





8 



3 Applications 



3.1 Choice of genes: coronal and sagittal atlases 

The notion of an atlas of gene expression in the adult mouse brain rests on the assump- 
tion that there is a constant component across all brains at the final stage of development 
(the developmental atlas adresses the challenge of measurement of this component at earlier 
stages [23] )• For an account of the standardization process that began in 2001 and led to 
the data generation and release of the Allen Brain Atlas, see [6]. 

The issue of reproducibility of ISH data can been addressed in several ways during the 
analysis of data. In NeuroBlast, the user can specify a given image series as input. The 
BrainGeneExpressionAnalysis toolbox (BGEA) is based on the analysis of the matrix 
of expression energies [TJ whose columns consist of brain-wide gene-expression data. This 
restricts the choice of genes to be analyzed in by BGEA to the 4,104 genes for which 
a brain-wide, coronal atlas was developed. For these genes, sagittal, registered data are 
also available in the left hemisphere. We computed correlation coefficients between sagittal 
and coronal data. The left-right correlation coefficients are not all positive. Sagittal datasets 
usually come from brain sections taken from the left hemisphere only. Hence the computation 
of correlation between (co-registered) sagittal and coronal data has to be restricted to the 
voxels belonging to the left hemisphere. For each gene g in the coronal atlas, we computed 
the following correlation coefficient between sagittal and coronal data: 

. . _ Sueleft hemisphere -^sagittal ( v , d) E corona _l(v , g) 

Psagittal/coronall^J — / = ; \P) 

A/E^elcft hemisphere -^sagittal (v, d) 2 X/ugleft hemisphere -^coronal ( v , d) 2 

where Sagittal an d -^coronal are the voxel-by-gene matrices of Equation [T] for sagittal and 
coronal data respectively. The results are shown on Figure |3j Some genes have negative 
correlation between sagittal and coronal data. The gene with highest value of p S agittai/coronai 
is Tcf7l2. The present study focuses on genes for which the correlation is larger than the 
25th percentile of the distribution of Psagittai/coronai- 

This set of G Atlas := 3, 041 genes serves as a reference set to which special sets of genes 
can be compared using the methods described above. In particular, this choice excludes all 
the genes with negative correlation. Other user-defined choices of genes are possible within 
the coronal atlas. They can be implemented by modifying the data matrix [T] and the list of 
genes corresponding to its columns in BGEA. 

The sorted entries of the upper-diagonal part of the induced co-expression matrix C atlas 
are plotted on Figure 131(a). The pair of genes with highest co-expression are Atp6v0c and 
Atp2a2, whose expression profiles are plotted on Figures |3](b,c). The profile of co-expressions 
is fairly linear, except at the end of the spectrum, which motivates a uniform exploration of 
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Figure 3: Sorted correlation coefficients between expression energies evaluated from sagittal 
and coronal sections in the left hemisphere of the mouse brain. 
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the interval [0, 1] when studying co-expression networks (see the pseudocode in Supplemen- 
tary Materials S2). 

3.2 Application to a set of addiction-related genes 

The methods reviewed above were applied to a set of 288 genes related to nicotine addiction 
retrieved from the NicSNP database^] The simulation of the cumulative distribution 
function of co-expression networks of size 288 can be compared to the one of the special set, 
and plotted together on [5j Since the CDF of the special sets is larger than average at low 
values of co-expression, the special set is not more co-expressed as a whole than expected 
by chance. This is confirmed by the statistics of graph properties of networks of 288 genes 
(Figures [6] and [7]). See [Mj for a set of autism- related genes that is more co-expressed in the 
brain than expected by chance). 



http : //zork . wustl . edu/nida/Results/datal .html 
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Figure 4: (a) Sorted elements of the upper-diagonal part of the co-expression matrix of the 
coronal atlas, C atlas . (b) Maximal-intensity projection of the expression energy of Atp6v0c. 
(c) Maximal-intensity projection of the expression energy of Atp2a2. The pair of genes 
(Atp6v0c,Atp2a2) has the highest co-expressldn in the coronal atlas, 0.9781. 
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Figure 5: Cumulated distribution function of the upper-diagonal entries of the co-expression 
matrix of 288 genes (the special co-expression network C set of the flowchart [T| from the 
NicSNP database, for which mouse orthologs are found in the Allen Atlas of the adult 
mouse brain. As the red curve (or CDF set , Equation 15) sits above the simulated average of 
the simulated mean of CDFs (or (CDF), Equation 17) of co-expression networks of 288 genes, 
at low values of the threshold, the special set as a whole appears to be less co-expressed than 
expected by chance. 
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Figure 6: Monte Carlo analysis of the graph underlying the co-expression matrix 
of 288 genes from the NicSNP database. Average and maximum size of connected com- 
ponents as a function of the threshold (the quantities A(p) and JA(p) defined in Equations M 
and M are plotted in red circles and red stars respectively, the quantities (A(p)} and (Ai(p)) 



defined in Equations 11 and 10 are plotted in cyan circles and cyan stars respectively). 
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Figure 7: Monte Carlo analysis of the graph underlying the co-expression matrix 
of 288 genes from the NicSNP database. Estimated probabilities for the average and 
maximum size of connected components to be larger than in random sets of genes of the 
same size (the quantities defined in Equations 12 



and 13 respectively). 



However, the graph-based procedure returns special sets when the threshold on co- 
expression goes from 1 to 0, that may have exceptional neuroanatomical properties com- 
pared to sets of the same size, even if this does not affect the distribution of average and 
maximal size of connected components. For each of the connected components, the sum of 
expression energies can also be compared to the partition(s) of the brain given by the ARA, 
inducing fitting scores in each brain regions (see Supplementary Materials S4 for mathemat- 
ical details). The probability for each connected component of thresholded co-expression 
networks to have a larger fitting score in a given brain region can be estimated. Imposing 
a threshold on this probability (99% for instance) returns sets of genes with exceptional 
anatomical properties. For the coarsest partition of the left hemisphere, a small set of 3 
genes (Rgs9, Drd2, Adora2a) connected at a co-expression of 0.9, is in the 99th percentile 
of fitting scores in the striatum (see Figure [8] for a bar diagram of the estimate of P- values 
of fitting scores, and a maximal-intensity projection of the sum of the expression energies 
of these genes). Even though this set of genes is not exceptional in terms of its size at this 
value of the co-expression threshold, it has exceptional anatomical properties. 
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Figure 8: One of the connected components of the co-expression network, at co-expression 
threshold 0.9. It is better-fitted to the striatum (STR) than more than 99% of the set of three 
genes drawn from the coronal Allen Atlas of the adult mouse brain. The symbols for other 
brain regions read as follows: Basic = 'basic cell groups', CTX = cortex, OLF = olfactory 
areas, HIP = hippocampal region, RHP = retrohippocampal region, PAL = pallidum, TH 
= thalamus, HY = hypothalamus, MB = midbrain, P = pons, MY = medulla, CB = 
cerebellum. 
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4 Conclusion and outlook 



The restriction of the first release of the BrainGeneExpressionAnalysis toolbox to the 
coronal atlas of the adult mouse brain corresponds to a restriction to genes for which brain- 
wide data are available. However, the sagittal atlas of the adult mouse brain contains more 
than 20, 000 genes, which are included in the Neuroblast and AGEA tools. The second 
release of BGEA will include these genes and restrict the Allen Reference Atlas to voxels 
where all the genes have ISH data (these voxels correspond to the left hemisphere of the 
brain). It would also be interesting to estimate the variability of the results under changes 
of probabilistic universe C* atlas (by substuting the sagittal atlas to the coronal atlas, and by 
choosing different image series to construct the data matrix). 

Furthermore, the development of large-scale neuroscience is making comparable atlases 
available to the research community for other species (see |25j for the Allen Atlas of the 
human brain, and for ZEBrA, the Zebra Finch Expression Brain Atlas), and the de- 

velopment of computational resources for the analysis of large datasets can be adapted from 
the Allen Atlas of the adult mouse brain to other atlases, allowing insights into evolution 
and into the validity of animal models (|28j). 

Moreover, the size of voxels in the Allen Brain Atlas is large in scale of brain cells, and 
each voxel may contain cells of different types. Several studies [321 EH ESI EHl ED E21 EH] have 
obtained cell-type-specific transcriptional profiles using microarray experiments. Comparison 
between ISH and microarray data is an ongoing challenge [36], and steps were taken in [37] 
to estimate the brain-wide density profiles of cell types by combining the Allen Atlas to the 
transcriptional profiles of cell types. This sheds light on the cellular origin of co-expression 
brain-wide co-expression patterns of genes. The corresponding Matlab code will be included 
in the second release of BGEA. 
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6 Supplementary Materials 



6.1 SI: Co-expression networks, graph properties 

Consider a set of genes of size G se t, as in the ellipse on the left-hand-side of the flowchart 
[T] They correspond to indices (gi, ■ ■ ■ , gc sat ) m t ne columns of the voxel-by-gene matrix 
E of expression energies. We can construct the co-expression matrix by extracting the 
coefficients of the co-expression matrix of the atlas corresponding to these genes. Let us 
denote this matrix by C set : 

C sct (t, J ) = C sct (g l ,g J ). (6) 

After applying the thresholding procedure, the co-expression matrix is mapped to a 
matrix C s p et : 

c s ;\t,j) = c sct (t,j) x i(c sct (*, j) > P ). (7) 

Then for every integer k between 1 and G se t we can count the number N p (k) of connected 
components of C p that have exactly k genes in them (using Tarjan's algorithm [38] . im- 
plemented as the function graphconncomp .m in Matlab). We can study the average size 
of connected components of thresholded co-expression networks and the size of the largest 
connected component: 

A(n) _ Etl kN p (k) 

M{p) = max {k e [l..G sct ], N p (k) > 0} , (9) 

as a function of the threshold p. -4.(0) and Ai(0) both equal the size of the set of genes, as 
the whole set is connected before any thresholding procedure is applied. At large thresholds 
every single singe is disconnected from the other genes, as having co-expression equal to one 
is equivalent to having exactly the same expression across the whole brain. So at threshold 
1 all the connected components have size one, and -4.(1) — M.{1) = 1. 



6.2 S2: Monte Carlo study of gene networks 

We would like to compare the properties of the matrix C sct to the ones of C atlas . In order to 
eliminate the sample-size bias, we are going to study some properties of the graph underlying 
C set , and to compare them to the properties of the graphs underlying submatrices of (7 atlas 
of the same size, G sct . 

To explore the graph property of the gene network, we have to choose a discrete set 
of thresholds regularly spaced between and 1, and to apply the procedure of Equation [7] 
usuing each of these thresholds. Call M the number of random sets of genes to be drawn. 
The computations can be described as follows in pseudocode: 

1. Choose a number of thresholds T to study. 

2. Choose a number of draws M to be performed for each value 



22 



of the threshold; 

3. For each integer t between 1 and T: 
3. a. consider the threshold pt — 4; 

3. b. compute the connected components of the thresholded matrix , as defined 
in Equation g; call M se \p t ) the size of the largest connected component, and 
A T (pt) the average size of connected components; 

4. for each integer n between 1 and J\f: 

draw a random set of distinct indices of size G sct from [1..G], 

extract the corresponding submatrix of C atUs ; 

call it C n , and repeat step 3 after substituting C n to C set ; 

call A4 n (p t ) the size of the largest connected component of C^ t , and A n (pt) the 
average size of connected components. 

At each value p of the threshold, we therefore have: 

- the size of the maximal connected component .M set (p), 

- a distribution of M numbers, each of wchich is the size of the largest connected components 
of a random submatrix of the same size as the set of genes to study, thresholded at p. 

When the number of random draws is sufficiently large, we can estimate the means of 
the average and maximum sizes of connected components: 

1 M 

n=l 

1 N 

n=l 

We can study where in Ai set (p) (resp. A sct (p)) sits in the distribution and estimate the 
probabilities of A4 sct (p) and A set (p) being larger than expected by chance: 

1 N 

Pla rgCr (^ Ct (p)) = jj 1 ^ • ( 12 ) 

n=l 

1 M 

P^ gC r(M sct (p)) = ^ E 1 i Mm *(p) * M«(p)) , (13) 

n=l 

6.3 S3: Cumulative distribution functions (CDFs) 

Given the G set -by-G set co-expression matrix C set , consider the coefficients above the diagonal 
(which are the meaningful quantities by construction) and arrange them into a vector C vec 
with N = G set (G sct - l)/2 components: C vec = {C gh }x< g < Gset>h>g . The components of this 
vector are numbers between and 1. For every number between and 1, the cumulative 



23 



distribution function of C, denoted by CDF set is defined as the fraction of the components 
of C vcc that are smaller than this number: 

CDF sct : [0, 1] -> [0, 1] (14) 
1 N 

x ^ ^$^ 5 c voc (fc)<x- (15) 
k=\ 

For any set of genes, CDF sct is a growing function CDF spccial (0) = and CDF spccial (l) = 1. 
For highly co-expressed genes, the growth of CDF special is concentrated at high values of 
the argument (in a situation where all the genes in the special set have the same brain- 
wide expression vector, all the entries of the co-expression matrix equal 1). To compare the 
function CDF special to what could be expected by chance, let us draw J\f random sets of G spC ciai 
genes from the Atlas, compute their co-expression network by extracting the corresponding 
entries from the full co-expression matrix of the atlas ((7 atlas ). This induces a family of M 
growing functions CDFj, 1 < % < R on the interval [0, 1]: 

Vl<i<jV, CDF, : [0,1] -> [0,1]. (16) 

From this family of functions, we can estimate a mean cumulative distribution function 
(CDF) of the co-expression of sets of G spec iai genes drawn from the Allen Atlas, by taking 
the mean of the values of CDF, across the random draws: 



Vxg[0,1], (CDF)(x) = -i^CDF l (x). (17) 

i=i 

Standard deviations CDF dcv of the distribution of CDFs are estimated as follows on the 
interval [0,1]: 



Vxe[0,l], CDF dev (a;) 



1 N 

^(CDF^aO-fCDFK*))' 



(18) 



6.4 S4: Comparison to classical neuroanatomy 

Consider a system of annotation in the voxelized version of the ARA. Let Vl be the set of 
voxels in the annotation, let R be the total number of regions in the annotation, and let 
{oji, . . . ,oor} be the regions in the annotation. For the sake of simplicity, the present paper 
focusses on the coarsest annotation, for which Q is the left hemisphere, and R = 13. 



24 



For a set of G set genes, labelled {gi, . . . , ga act } i n the atlas, the total brain- wide expression 
energy is 

Gsct 

E set (v,{ gi ,...,g Gsct }) = J2E(v,g k ). (19) 

k=l 

Using the same Monte Carlo procedure as in supplementary S2, we draw M sets of genes 
from the atlas, and compute the total gene-expression energy defined by Equation [20] for 
each of these sets. Hence for each random draw (labelled by an integer n in [1.JV]) one has 
a set of genes labelled {g™, . . . , }, and the corresponding brain- wide sums of expression 
energies 

E^(v, K, . . . , g n Gs J) = £ E(v, gl). (20) 

k=l 

The fitting scores to each region in the ARA can be computed both for E set and for the 
each of the random sets of G se t genes: 

<C ■= ME**), (21) 
^rand,n ;= ^^rand.n) ^ 

Hence, one can estimate the position of the fitting score 0^ et in the distribution of fitting 
scores by evaluating the folllowing fraction: 

1 N 

p r et = ji E 1 (<f>r > 0r d>n ) (23) 

n=l 

which goes to the probability for <ft s r et being larger than expected by chance for a sum of G se t 
expression energies. A histogram of 23 is plotted on Figure [8j for the regions of the coarsest 
annotations of the left hemisphere, and for a set consisting of Rgs2, Drd2 and Adora2a. 



25 



